Multistage intelligent database search method

ABSTRACT

An improved multistage intelligent database search method includes (1) a prefilter that uses a precomputed index to compute a list of most “promising” records that serves as input to the original multistage search method, resulting in dramatically faster response time; (2) a revised polygraph weighting scheme correcting an erroneous weighting scheme in the original method; (3) a method for providing visualization of character matching strength to users using the bipartite graphs computed by the multistage method; (4) a technique for complementing direct search of textual data with search of a phonetic version of the same data, in such a way that the results can be combined; and (5) several smaller improvements that further refine search quality, deal more effectively with multilingual data and Asian character sets, and make the multistage method a practical and more efficient technique for searching document repositories.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] This invention relates to methods of database searching, and moreparticularly to improvements to a highly error-tolerant yettime-efficient search method based on bipartite weighted matching.

[0003] 2. Description of Related Art

[0004] Inexact or “fuzzy” string comparison methods based on bipartitematching are highly appropriate for finding matches to users'queries ina database, despite errors and irregularities that often occur in bothqueries and database records. Given the massive growth in both thequantity and availability of information on the world Internet, and thedependence of corporations, government agencies, and other institutionson accurate information retrieval, a pressing need exists for efficientdatabase search methods that are highly error-tolerant, and that alsofunction well on the vast quantity of “semi-structured” (looselyformatted) textual data that is available to users of the Internet andcorporate intranets.

[0005] Such an error-tolerant database search method is the subject ofU.S. Pat. No. 5,978,797 by Peter N. Yianilos, assigned to NECCorporation, Inc., entitled “Multistage Intelligent String ComparisonMethod.” The heart of that invention is a software function thatcompares two text strings, and returns a numerical indication of theirsimilarity. To approximate a more “human” notion of similarity thanother approaches to inexact comparison, this function utilizes abipartite matching method to compute a measure of similarity between thetwo strings. String comparison using bipartite matching is disclosed inU.S. Pat. No. 5,841,958 by Samuel R. Buss and Peter N. Yianilos,assigned to NEC Corporation, Inc. U.S. Pat. No. 5,978,797 discloses theapplication of bipartite matching-based string comparison to databasesearch, in which the similarity of each database string to a query iscomputed based on an optimal weighted bipartite matching of charactersand polygraphs (short contiguous stretches of characters) common to bothdatabase record and query.

[0006] This “multistage” search method operates on a database consistingof records, each of which is viewed simply as a string characters, andcompares each record with a query consisting of a simple free-formexpression of what the user is looking for. The comparison processoccurs in three stages, in which the earlier stages are the mosttime-efficient and eliminate many database records from furtherconsideration. The final output is a list of database records ranked bytheir numerical “similarity” to the query. The multistage approach,which applies increasingly stringent and computationally intensiveversions of bipartite matching to smaller and smaller sets of records,makes it possible to compare the query with thousands or hundreds ofthousands of database records while still delivering acceptable responsetime. The result is that in almost all cases, the output list ofdatabase records is the same list that would be produced by applying thefinal and most discerning (but slowest) process stage to the entiredatabase.

[0007] A number of weaknesses and unexploited potentialities areassociated with the original multistage database search method disclosedin U.S. Pat. No. 5,978,797:

[0008] ORIGINAL METHOD WAS NOT SCALABLE TO LARGE DATABASES. A majorweakness of the original method is that it must examine every characterof every record in a database in order to determine a list of recordsmost similar to a query. The original method is thus time-efficient onlyfor small to medium-sized databases, consisting of tens or hundreds ofthousands of records. The method is not scalable to large databases.

[0009] ORIGINAL METHOD DID NOT TAKE ADVANTAGE OF THE BIPARTITE GRAPH TOPROVIDE VISUAL FEEDBACK TO THE USER. The original method used the totalcost of the bipartite matching of characters and polygraphs betweenquery and database record as a measure of their similarity. This is asingle number, which suffices for the ranking of records in the outputlist. However, the bipartite graph that is computed by the final filterstage contains information that can be used to provide sophisticatedfeedback to the user regarding the “matching strength” of each characterin a database record.

[0010] ORIGINAL METHOD WRONGLY WEIGHTED MATCHING POLYGRAPHS OF DIFFERENTLENGTHS. The three stages of the multistage method compute bipartitematchings of single characters and polygraphs common to a query and adatabase record. Since a matching 6-graph (stretch of 6 characters) isclearly more significant than a matching 3-graph or 2-graph or 1-graph(single character), the original method adopted a weighting scheme thatweighted matching polygraphs in direct proportion to their length. Thisapproach was mistaken, and frequently resulted in a poor similarityranking.

[0011] A more correct analysis of bipartite matching of polygraphs showsthat longer polygraphs naturally receive greater weight in the overallmatching due to the greater number of shorter polygraphs they contain,which are also included in the matching.

[0012] This natural weighting effect due to polygraph inclusion isalready so pronounced that a correct weighting scheme should seek toattenuate it, not further magnify it, as did the original method. Underthe original weighting scheme, database records containing many shortmatching polygraphs but no very long ones, tended to be stronglyoutranked by records that happened to contain a single long matchingpolygraph. This frequently resulted in clearly less-similar records (inthe judgment of a human being) outranking more-similar records.

[0013] ORIGINAL METHOD INCORPORATED NO KNOWLEDGE OF CHARACTER PHONETICS.Bipartite matching operating directly on English or othernatural-language strings does not capture points of similarity thatdepend upon knowledge of character phonetics, e.g., that in English “ph”usually represents the same sound as “f”. While a typographic error in aquery or database record generally substitutes an unrelated symbol forthe correct one, misspellings often substitute a symbol (or symbols)that sound equivalent to the correct symbol. The original methodincorporated no such language-specific phonetic knowledge, whichfrequently resulted in degraded search quality.

[0014] In summary, the original multistage search method does not scaleto large databases, does not exploit the bipartite graph to provide anyvisual feedback to the user on which characters match his query,employed a faulty character and polygraph weighting scheme, and does notcapture points of similarity with a query that depend on a knowledge ofphonetics.

SUMMARY OF THE INVENTION

[0015] Briefly described, the invention comprises the followingelements:

[0016] A “polygraph indexing prefilter” which serves as a fiont-endfilter to the multistage search method. This prefilter operates using aprecomputed index of all polygraphs of some single fixed length N(generally 3 or 4) that occur in the database. For each of theseN-graphs, the index maintains a precomputed list of all recordscontaining that polygraph.

[0017] When the user submits a search query, this query is resolved bythe filter into the list of unique N-graphs that it contains. Using theprecomputed index, the prefilter quickly determines a list of recordssharing one or more N-graphs with the query. A maximum list size M_(r)(often about 5,000 records) is enforced, and since the prefilter keepstrack of exactly how many N-graphs in each record are held in commonwith the query, it is able to return what are in effect the M_(r) most“promising” records.

[0018] This list of (at most) Mr records, normally a very small fractionof the whole database, then serves as input to the three stages of theoriginal multistage search method. Since the prefilter does not actuallyexamine any record, its operation is much faster than the later stages,each of which must examine every character of every input record.

[0019] The effect of this element of the invention is to make themultistage method scalable to databases considerably larger thanbefore—typically millions or tens of millions of records.

[0020] Visualization of the matching strength of each character in thedatabase records output by the multistage search method, usinginformation contained in the bipartite graph matching characters andpolygraphs in the database record and the query. Briefly, each characterin an output record is by definition contained in zero or morepolygraphs that are matched with polygraphs in the query. The lengths ofsuch containing polygraphs, as well as their graph edge displacements,provide the basis for a quantitative measure of “matching strength” atthat character position. The preferred embodiment uses the lengths ofmatching polygraphs containing each character position to determine anintegral matching strength for each character in an output record. Whenthe record is displayed to the user, each of its characters ishighlighted in proportion to this matching strength using font-basedhighlighting techniques, including but not necessarily limited to fontcolor, type size, type style, and underlining.

[0021] A revised scheme for weighting matching polygraphs in themultistage search method. The scheme used by the original method waswrong. The revised scheme weights each matching polygraph in inverseproportion to its length, thus attenuating but not wholly canceling thenaturally greater weight (due to polygraph inclusion) of the longerpolygraphs. The weighting factor used is the polygraph length raised toan exponent (typically negative). This exponent is a tunable parameter.Appropriate values for textual data are typically in the range −0.5 to−1.0.

[0022] A technique for complementing the direct application of themultistage method to textual data with a parallel application of thesame method to an alternative (e.g., phonetic) representation of thesame data. Thus, given a search query, two multistage searches areperformed: one search with the given query against the untransformeddatabase, and a second search with an alternative version of the queryagainst a transformed version of the same database (e.g., a phonetictranscription of the query searched against a phonetically transcribedversion of the database). A simple form of normalization ensures thatthe results of the two searches are comparable, and can be directlymerged. If a record occurs in the output of both searches, thehigher-scoring (more similar) occurrence determines the record's rank inthe merged output list. In this way, a multistage search that handlestypographic errors efficiently is complemented by a search that handlesphonetic misspellings efficiently. The result is a significantlyenhanced multistage search.

[0023] In addition, the invention comprises a number of smallerimprovements that further refine search quality, deal more effectivelywith multilingual data and Asian character sets, and make the multistagemethod a practical device for searching document repositories.

[0024] The invention may be more fully understood by reference to thefollowing drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025]FIG. 1 illustrates the prior art.

[0026]FIG. 2 illustrates the preferred embodiment of the presentinvention.

[0027]FIG. 3A is a schematic illustration of the static data structuresused by the GIP prefilter.

[0028]FIG. 3B is a schematic illustration of the dynamic data structuresused by the GIP prefilter.

[0029]FIG. 3C is a flowchart of the operation of the GIP prefilter.

[0030]FIG. 4A illustrates match strength visualization using font-basedhighlighting techniques on a list of words.

[0031]FIG. 4B illustrates match strength visualization using font-basedhighlighting techniques on a list of paragraphs.

[0032]FIG. 5A illustrates the additional matching polygraphs presentwhen a matching character stretch is increased in length by onecharacter.

[0033]FIG. 5B illustrates an example of poor match ranking when there isno attenuation of the weighting of long polygraphs due to polygraphinclusion.

[0034]FIG. 5C is a plot of the relative contributions to the measurementof string similarity made by matching polygraphs of differing length,under several different weighting schemes.

[0035]FIG. 6 is a flowchart of the operation of dual phonetic andnon-phonetic database search.

[0036]FIG. 7A illustrates a set of initial alignments of a query and adatabase record.

[0037]FIG. 7B is a flowchart of the improved realignment process.

[0038]FIG. 8 is a flowchart of the penalizing of database records basedon alignment and record length.

[0039]FIG. 9 is a flowchart of the operation of a database search ofnon-alphabetic character data.

[0040]FIG. 10A is a schematic illustration of the preprocessed databasestructure used in a multistage document search.

[0041]FIG. 10B is a flowchart of the operation of multistage documentsearch.

DETAILED DESCRIPTION OF THE INVENTION

[0042] During the course of this description like numbers will be usedto identify like elements according to the different figures thatillustrate the invention.

[0043]FIG. 1 is a schematic illustration of the prior art disclosed inU.S. Pat. No. 5,978,797.

[0044]FIG. 2 is a schematic illustration of the preferred embodiment 10of the present invention. A transductive preprocessor 12 prepares asearchable version of the text of W database records. These are input toa polygraph indexing prefilter 13, or “GIP prefilter,” which serves as afront-end to the three filter stages F1, F2, and F3 comprising theoriginal multistage method 14. These four filters, including the GIPprefilter, comprise a time-tapered sequence of filter stages, where theinput of each filter is a set of database records, and the output is asubset of its input. The output of a filter is the input of the nextfilter in the sequence. Each filter applies a more stringent computationof “similarity” than the previous filter between the query and itsinput. Earlier filter stages are fastest and eliminate the greatestnumber of records from further consideration. Later stages are the mostcomputationally intensive, but operate on much smaller subsets ofrecords.

[0045] When a search query is submitted by a user, the largest set of Wrecords is input to the GIP prefilter, which outputs Z records, where Zis normally a small fraction of W. The F1, F2, and F3 filters reduce thenumber of records to a final output list of X most-similar records,ranked by their degree of similarity to the query. The result of thistapered filters approach is that the final output list of X records isin almost all cases the same as would be produced by applying the final,most discerning (but slow) filter stage, F3, to the entire database of Wrecords.

[0046] The F1 15 and F2 16 filter stages of the present invention are asin the prior art. In the preferred embodiment, F1 takes account only ofcounts of matching polygraphs, and F2 performs a bipartite matching witha “free” realignment, as disclosed in U.S. Pat. No. 5,978,797.

[0047] The F3 filter stage 17 is as in the prior art as to its basicsteps, but some of these steps incorporate significant improvements inthe present invention. Polygraphs common to query and record areidentified as in the prior art 18. The computation of polygraph weights19 corrects a major deficiency, a revised realignment process 20addresses a frequent source of search failure, and the addition ofalignment and record-length penalties 21 produces a more natural rankingof records that have essentially the same degree of similarity to thequery.

[0048] Finally, a visualization postprocessor 22 computes a matchstrength for every character in the X output records using informationcontained in the bipartite graphs computed by F3, and displays therecords with characters highlighted in proportion to their matchstrength.

[0049] (1) Polygraph Indexing Prefilter

[0050] The three filters F1, F2, and F3 examine every character of everyrecord in their input. If the entire database is input to F1, then everycharacter in the database will be examined in the course of everysearch. For this reason, the original method was time-efficient only forsmall and medium-sized databases (up to tens or hundreds of thousands ofrecords, depending on record length). The purpose of the additional GIPprefilter is to make the multistage method scalable to larger databasescontaining millions or tens of millions of records.

[0051]FIG. 3A illustrates the data structures used by the GIP prefilterto reduce the complete set of database records to a list of typicallyseveral thousand most “promising” records. Let N be a fixed value,typically either 3 or 4. Every N-graph gi of the G unique N-graphsoccurring in the database is stored in a B-tree 30, together with apointer 31 to a list Ri 32 of all records in the database that containN-graph gi. The length of each list li is explicitly stored with thelist for ease of access.

[0052] Generally speaking, record lists for the most common N-graphs ina database are not useful for determining the most promising records,especially when an N-graph occurs in the vast majority of databaserecords. Hence, in the preferred embodiment, record lists are notactually built for N-graphs occurring in more than 75 to 80 percent ofthe records in the database. The B-tree pointer for such an N-graphcontains a flag indicating that its frequency exceeds the chosenthreshold.

[0053] These data structures of FIG. 3A are precomputed, and can belarge. Hence the GIP prefilter is employed under circumstances where theneed for speed in searching a large database warrants the increase inmemory utilization.

[0054]FIG. 3B depicts the key search-time data structure used by theprefilter. It is a table 40 containing an entry for each record in thedatabase, where each entry comprises three elements: an integer counterC, a forward pointer P1, and a backward pointer P2. The counter willcount the number of distinct N-graphs each record shares with a query Q.The pointer elements enable table entries to be linked to each other indoubly linked lists. A dynamically expanding vector 41 is used to storethe heads of such doubly linked lists L₁, L₂ , L₃ , . . . , after theyare created. Each record's table entry will be on at most one suchdoubly linked list at any given time.

[0055]FIG. 3C gives a flowchart of the operation of the GIP prefilter.The prefilter is invoked with three input parameters: a search query Q,a maximum number M_(l) of list items to traverse, and a maximum numberM_(r) of records that are to be output by the prefilter. (Typical valuesfor M_(l) and M_(r) are 10,000 records and 5,000 records, respectively.)

[0056] A query Q is resolved into a list G of its unique N-graphs 50.For unique each N-graph g in the query 52, the prefilter tries to lookup g in the B-tree 54. If g is in the B-tree 56, the B-tree delivers thepointer to its associated record list R_(g) 58. The pointer to R_(g) isadded to a list R of “relevant” record lists 60. (If the B-tree containsa flag value indicating that the N-graph was too common to have a recordlist, the N-graph is ignored.)

[0057] After the list R of relevant record lists is complete 62, R issorted by increasing length of its component lists R_(g) 64. That is,the shortest record lists will be at the beginning of R, and the longestat the end of R. The point of sorting R is that the shortest recordlists correspond to the least common, and hence most potentiallysignificant, N-graphs that are held in common with the query. Theprefilter will therefore traverse the record lists in order ofincreasing list length.

[0058] The prefilter now initializes the integer counters 66 for eachdatabase record in the table 40. Also, all pointers in the table are setto a null value.

[0059] The lists in R are now linearly traversed in order (shortestlists first) 68. For each record in each list 70, the associated countervalue c is retrieved 72 and tested 74. If the counter value is zero,i.e., if this is the first time this record has been encountered on alist, the record is added to doubly-linked list L_(l) (which is createdif necessary) 76. The list L_(l) is the list of records that have beenencountered exactly once by the prefilter.

[0060] If the counter value c is not zero, then this record has beenencountered exactly c times before, and is on doubly linked list L_(c).It is removed from list L_(c) and inserted into list L_(c)+1 (which iscreated if necessary) 78. List removal and insertion are fast, since thelist item for a given record is quickly found in the table 40, which isindexed by record.

[0061] The counter c in the table 40 is now incremented 80.

[0062] After processing each record on a record list, the prefilterchecks to see if it has traversed M_(l) records 82. If so, it exits therecord list traversal loops.

[0063] After record list traversal is complete 84,86, the prefilterbuilds its output record list L from lists L_(i), starting with the listwith the maximum value of i 88. Thus it outputs first those recordscontaining the most distinct N-graphs in common with the query. Itproceeds until either the lists L_(i) are exhausted or the maximum M_(r)of output records is reached 90. For each record in a list 92, as longas the maximum M_(r) is not exceeded 94, the record is added to theoutput list L 96. When the list is exhausted 98, the prefilterdecrements i 100 and processes the next list L_(i). When the lists areexhausted or the maximum M_(r) is reached 102, the list L is output 104.The output list L of the GIP prefilter is a list of the (at most)M_(r)“most promising” records. These records alone are passed to the F1filter.

[0064] The advantage of the GIP prefilter is that a large database of Wrecords is reduced very quickly to a most promising subset of Z records,typically several thousand. This prefilter is fast because itessentially retrieves precomputed lists of records. The result is that,for any particular search query, the bulk of a large database need notbe traversed by the later filter stages that must examine everycharacter of every input record.

[0065] It should be understood that what is claimed here is a polygraphindexing method used as a prefilter for the multistage database searchmethod, and not as the primary search engine.

[0066] (2) Visual Feedback Based on Bipartite Graphs

[0067]FIGS. 4A and 4B illustrate the visualization of per-charactermatch strengths using font-based highlighting techniques. In theseexamples, type size and emboldening are used to indicate the relativematch strength of each character in each displayed database record.Other typical techniques include the use of a set of colors to indicatematch strength (e.g., red for strongest match intensity, shading to bluefor weaker match intensities).

[0068] Any method of highlighting characters that communicates to theuser a relative match strength at each displayed character positionfalls within the spirit and scope of this invention, including but notnecessarily limited to the use of colored fonts, typefaces of differingsize or style, and underlining.

[0069] Regardless of the visualization techniques used, thedetermination of the match strength of a given character in a databaserecord is based upon the numerical contribution made by polygraphsincluding that character to the total cost of the bipartite matching.The bipartite graphs computed by the F3 filter in the multistage searchcontain this information.

[0070] If a given database character belongs to no matching polygraphs,its match strength is zero by definition. If it belongs to one or morematching polygraphs, its match strength may be thought of as somecomposite function of the lengths of those polygraphs, any weightingfactors assigned to them, and their graph edge displacements.

[0071] In practice, only a small number of match strengths need bediscriminated for purposes of providing effective visual feedback to theuser. Since the range of polygraph lengths utilized by the search methodis normally 1-graphs (i.e., single characters) to 6-graphs, thepreferred embodiment of the invention ignores edge displacements, andassigns a match strength to each character equal to the length of thelongest matching polygraph containing that character. These six matchstrengths are then rendered using different color shades for displayingthe character, or other highlighting techniques such as those shown inFIG. 2.

[0072] (3) Attenuation of the Effect of Polygraph Inclusion on Matching

[0073] The original multistage search method wrongly weighted matchingpolygraphs of different lengths when computing bipartite matchings.Since a matching 6-graph (stretch of 6 characters) is clearly moresignificant than a matching 3-graph or 2-graph or 1-graph (singlecharacter), the original method adopted a weighting scheme that weightedmatching polygraphs in direct proportion to their length. This approachwas mistaken, and frequently resulted in a poor similarity ranking. Theoriginal method overlooked the fact that the contribution of longerpolygraphs to the matching is already naturally magnified due topolygraph inclusion.

[0074]FIG. 5A illustrates the meaning of polygraph inclusion. Considertwo database records RI and R2 which contain, respectively, a4-character and a 5-character stretch in common with a query Q. Thus, R2has one more character in common with query Q than R1. However, asillustrated in the drawing, there are actually five additionalpolygraphs that R2 has in common with the query, and all five of theseadditional polygraphs will contribute to the bipartite matching.

[0075] Hence, the effect on the overall matching of the singleadditional matching character in R2 is magnified by the fact that thisadditional character is contiguous with a stretch of other matchingcharacters, and the magnification effect will be greater the longer thestretch of characters is. In general terms, a new matching characteradded to a stretch of matching characters of length N results in N+1 newmatching polygraphs (assuming that N+1 still lies within the range ofpolygraph lengths considered for the bipartite matching). Thus, amatching stretch lengthened from 1 to 2 results in 2 new matchingpolygraphs (a 2-graph and a 1-graph), a matching stretch lengthened from3 to 4 results in 4 new matching polygraphs (a 4-graph, a 3-graph, a2-graph, and a 1-graph), etc.

[0076] This analysis shows that as the length of a stretch of matchingcharacters increases, its contribution to the overall matching becomesvery weighty, even without the application of further length-basedweighting factors. In fact, in the preferred embodiment of themultistage bipartite matching filters, this contribution increases inproportion to the square of the length of the matching stretch.

[0077] In the original method, this natural weighting effect wasmagnified still further by a linear weighting scheme that multiplied thecontribution of a polygraph by a weighting factor equal to its length.The effect of this weighting scheme was that the total contribution of astretch of matching characters to the overall matching increased inproportion to the cube of its length.

[0078]FIG. 5B shows an example of poor match ranking under the originalweighting scheme.

[0079] Record R1 contains a 5-character stretch (“phone”) in common withthe query “Vodaphone.” Record R2 contains only a 4-character stretch incommon (“Voda”). In spite of the fact that R2 is obviously more similarto the query, R1 is ranked higher, owing to the powerful effect of thelonger matching character sequence.

[0080] Experiment shows that a correct polygraph-weighting scheme shouldnot magnify the natural weighting effect due to polygraph inclusion, butinstead attenuate it somewhat. The revised scheme in the presentinvention weights each matching polygraph in inverse proportion to itslength, attenuating but not wholly canceling the naturally greaterweight of longer stretches of matching characters. The weighting factorused is the polygraph length raised to a negative exponent, but otherfunctions may be used as well. This exponent is a tunable parameter.Appropriate values for textual data are typically in the range −0.5 to−1.0.

[0081]FIG. 5C plots the relative contributions to the overall measure ofstring similarity by matching character stretches of lengths 1 to 6, forseveral different values of the exponent parameter. An exponent of 1corresponds to the weighting scheme used in the original method. Anexponent of 0 corresponds to the situation of no weighting in additionto the natural effect of polygraph inclusion. An exponent of −1attenuates this natural effect, so that the contribution of a stretch ofN matching characters increases in proportion to N log N.

[0082] The result of weighting polygraphs by their length raised to anegative exponent is that records that have many smaller matchingstretches of characters, but few or no longer ones, have a better chanceor ranking high in the list of matching records. This rectifies problemssuch as that exhibited in FIG. 5B.

[0083] (4) Phonetic Search using Transductive Preprocessing

[0084] The original multistage search method incorporated no knowledgeof character phonetics. Bipartite matching operating directly on Englishor other natural-language strings does not capture points of similaritythat depend upon knowledge of character phonetics, e.g., that in English“ph” usually represents the same sound as “f”. While a typographic errorin a query or database record generally substitutes an unrelated symbolfor the correct one, misspellings often substitute a symbol (or symbols)that sound equivalent to the correct symbol. The original methodincorporated no such language-specific phonetic knowledge, whichfrequently resulted in degraded search quality.

[0085] The present invention allows a query to be compared with a singledatabase represented in more than one form, e.g., its normalhuman-readable form and a phonetic form. This requires that the recordsin a database be transformed into an alternative phoneticrepresentation.

[0086] The transformation of a string of symbols into an alternativerepresentation is called “string transduction.” “Transductivepreprocessing” is a software method that is used to prepare alternativerepresentations of a query string or a database string, so that they canbe operated on in that form by the multistage search method. The presentinvention employs a transductive preprocessor to enable comparison of asearch query with the records in a database, based upon a phonetictransduction of both query and database records.

[0087] A phonetic transduction of the searchable portion of eachdatabase is created using an automaton that translates character stringsinto a phonetic alphabet. Note that this translation is not a one-to-onecharacter mapping, since multiple letters can represent a single sound(e.g., English “th”), and a single letter can represent multiple sounds(e.g., English “x”). Further, immediate context can alter a sound (e.g.,English “c”, which is soft when preceding an “e” or an “i”). Such rulesfor transcribing character strings in a given language into a phoneticalphabet are known collectively as a “phonetic grammar,” and usingstandard art an automaton A is created which outputs a phonetictransduction of an input string. This automaton used to preprocess everydatabase record into its phonetic equivalent. The phonetic version ofthe database is precomputed and stored with the non-phonetic version.

[0088]FIG. 6 gives a flowchart of the operation of a dual phonetic andnon-phonetic database search. When a query Q is submitted, a standardsearch is performed against the non-phonetic database D, returning aranked list of matching records T 120.

[0089] Next the query Q is translated into its phonetic equivalent Q_(p)using the same automation A that was used in the phonetic transductionof the database 122. Then- a search is performed for Q_(p) against thephonetic version of the database D_(p), returning a ranked list ofmatching records P 124.

[0090] It is necessary that the matching costs of the records on lists Tand P be comparable. It is sufficient for this purpose that the“padding” length used in each of the two individual searches tocompensate for variations in database record lengths be the same value.This value can be set to the greater of the two padding lengths thatwould be used if the two searches were performed alone.

[0091] The two lists T and P are now merged into one list L 126. Sincethe same record can appear twice in this merged list, the merged list isprocessed to remove any lower-ranked duplicate records that occur. Onesimple method of removing duplicates is to sort L by record number 128,which brings any duplicate records together in the list. It is then easyto traverse the sorted list linearly, noting adjacent elements thatrepresent the same record in two forms (non-phonetic and phonetic). Thelower-ranking duplicate (i.e., the one with the higher match cost) isremoved from the list 130. Then the list is resorted by match cost 132,producing a single duplicate-free ranked list that is the output 134.

[0092] (5) Additional Improvements to the Multistage Database SearchMethod

[0093] 5.1 Improved Query Realignment Process

[0094] The F3 filter stage aligns the left end of the query over theleft end of the database record, performs a bipartite matching ofletters and polygraphs, and then picks a new alignment of query andrecord based on the average length of the bipartite graph edges. Thisprocess is iterated, computing a new better matching in each iteration.The goal of this realignment process is to find the query-recordalignment that produces the lowest-cost bipartite matching. Thus theprocess is iterated until it fails to produce a better matching, oruntil some predefined maximum number of realignments is reached.

[0095] The realignment process of the F3 filter stage in the originalmethod does not in every case discover the globally optimal alignment ofquery and record. Occasionally it finds only a “local” optimum. In sucha case, the F3 filter stage will underestimate the similarity of therecord with the query, and will assign the record too low a rank in thelist of matching records.

[0096]FIGS. 7A and 7B illustrate the new approach to realignment in thepresent invention. The new F3 filter selects a number of different“initial” alignments a₁, a₂, a₃, . . . , including left-alignment,right-alignment, and one or more intermediate alignments of query andrecord 140. For each initial alignment a_(i) 142, a small number(typically 1-2) of realignment iterations are performed, consisting of abipartite matching producing a graph G_(i) 144, an adjustment of thealignment by the average graph edge displacement in G_(i) 146, and are-computation of the bipartite matching and its graph 148. When this iscompleted for each initial alignment 150, the adjusted alignment A whosegraph cost C_(i) is lowest is chosen as the most promising alignment152, along with its graph G_(i) 154. Then the realignment iterationscontinue 156 with this adjusted alignment as in the prior art, until amaximum total number of realignments M_(r) have been performed. That is,A is adjusted based on the average edge displacement in G 158, then thematching and its graph are re-computed 160. If the match cost C has notincreased 162, the next iteration is allowed to proceed 164. If C hasincreased (i.e., gotten worse), the realignment loop exits beforecompleting M_(r) iterations, and the values of G and C of the previousiteration are restored 166 and output 168.

[0097] With this new approach to alignment, it is far more likely thatin situations such as that depicted in FIG. 7A, the search method willdiscover the optimum alignment of the query over the record string. Inthis example, the new approach endures that at the end of therealignment process, the query “French-English” will be positioned overthe most-similar middle portion of the record, rather than over theless-similar string “Spanish-English” near the beginning of the record.

[0098] 5.2 Privileging of Matches Near the Beginning of a Record, andMatches of Shorter Records

[0099] If two or more database records are determined to have exactlythe same level of similarity to a given query, it usually males sense tofavor records in which the preponderance of matching polygraphs occurearlier in the record. This is because information near the beginning ofa database record is often of greater importance or relevance thaninformation further downstream. E.g., in a list of employees, the namefield is generally most “significant”, and tends to precede other fieldslike home address and telephone number. Hence, for many applications itwould make sense to privilege matches near the beginning of a record insome way. The original multistage search method did not do this.

[0100] In a similar vein, if two or more database records are determinedto have exactly the same level of similarity to a given query, itusually makes sense to favor records that are shorter. This is becauseshorter records can be regarded as more “similar” to the query, inasmuchas they contain (by definition) fewer characters that are unmatched. Theoriginal method did not privilege shorter records in this way.

[0101] The present invention adds a final step to the F3 filter stagethat promotes matches near the beginning of records, as well as shorterrecords. “Matches near the beginning of a record” here means: recordsfor which the final query alignment chosen by F3 is at or close to thebeginning of the record (at or close to left-alignment).

[0102] After the optimal polygraph matching has been determined betweena record and the query, two small penalty values are added to the totalmatch cost of the record. One is based on the final query alignmentchosen by F3, and is called the “alignment penalty.” The other is basedon the record length, and is called the “record-length penalty.” Thetotal penalty added to the match cost is small enough to affect theranking only among records that have exactly the same similarity withthe query (as expressed by the match cost). The two penalty valuesthemselves are calibrated in such a way that the alignment penalty takesprecedence over the record-length penalty. In any group of outputrecords having exactly the same total match cost, the alignment penaltywill cause the records to be sorted according to query alignment.Records having both the same match cost and the same query alignmentwill occur adjacent to each other, and the record-length penalty willcause this subset of records to be sorted according to increasing recordlength. This generally results in the most natural-seeming order ofrecords in the output list.

[0103] In order to ensure that the total penalty added to the match costof a record does not affect the ordering of records that have differentmatch costs, the penalties are scaled by a small value x, which ischosen to be less than the minimum possible cost difference ΔC_(min).This minimum cost difference turns out to be the minimum of theweighting factors applied for each polygraph length. For example, if theweights are calculated as described above, they are equal to thepolygraph length itself raised to some negative exponent e. If thepolygraphs considered are 1-graphs to 6-graphs, and the exponent e is−1, then ΔC_(min) will be ⅙=0.167.

[0104]FIG. 8 is a flowchart of the operation of the postprocessor thatpenalizes records with respect to query alignment and record length. Foreach record r in the return list L 170, the alignment penaltyx*(1−1/P)*(A_(r)/P) is added to the total match cost C_(r) 172, where xis the scaling factor discussed above, A_(r) is the final queryalignment value chosen by F3, and P is the length to which the query andall records were padded by the search (a value greater than or equal tothe maximum record length plus the query length). Then the record-lengthpenalty x*(1/P)*(L_(r)/P) is added to C_(r) 174, where x is the samescaling factor, L_(r) is the length of the record, and P is the samepadding length. When matching costs have been penalized for all recordsthis way 176, the list L is re-sorted by increasing match costs 178 andoutput 179.

[0105] 5.3 Searching Data Represented in Non-alphabetic Character Sets

[0106] Writing systems for Asian languages like Japanese are generallynon-alphabetic, so it is not immediately apparent how a bipartitecharacter-matching method for searching should be applied to datarepresented in an Asian or other non-alphabetic symbol set.

[0107] However, transducers exist and can be developed using standardart, that convert Asian character data into phonetic or other alphabeticequivalents. Such a transducer A can be used to preprocess a database Dinto an alphabetic equivalent Dt. The transducer A preserves a mappingbetween characters in the Asian language, so that it is known whichalphabetic output characters correspond to which non-alphabetic inputcharacters. This mapping need not be stored in the computer's mainmemory, since it is useful mainly for visualization purposes, after theranked record list is generated.

[0108]FIG. 9 is a flowchart of the operation of a multistage databasesearch of non-alphabetic data. A query Q expressed in a non-alphabeticcharacter set is translated into an alphabetic equivalent Q_(t) usingthe same transducer A that was used to translate D_(t) 180. It is thencompared against the alphabetic records in D_(t), producing an outputlist of alphabetic records R_(t) 182. Then the non-alphabetic versionsof the records in R_(t) are retrieved from D 184, and the resulting listR is output 186.

[0109] 5.4 Postprocessing of Bipartite Graphs to Refine Match Quality

[0110] The bipartite graphs output by the F3 filter stage can be used torefine search quality, either by adjusting the contribution of certaingraph edges to the total match cost of a record, or by making smallalterations to the graph itself. After this graph post-processing, thematch scores of output records are recalculated, resulting in possiblereordering of the results list.

[0111] An example of the usefulness of such post-processing occurs withthe matching of accented characters or characters with diacritics. Ingeneral, it is desirable that the multistage search method treataccented and unaccented forms of the same character as equivalent, sothat a query lacking the accent will still be able to match perfectly adatabase record containing the query string in an accented form.Similarly, an accented query should be considered to match perfectly arecord containing that string in an unaccented form. However, it oftenhappens that both accented and unaccented versions of the same stringoccur in a database, and it is almost always true that accented andunaccented versions of the same character occur in a single record. Inthese situations, it can be desirable to penalize slightly thecontributions of graph edges connecting accented with unaccented forms,or even to alter the graph edges so that a better matching is attainedwith respect to accents and diacritics.

[0112] Any postprocessing of bipartite graph output that adjusts thegraph edge contributions to the total match cost or adjusts the graphitself for purposes of enhancing search quality, falls within the spiritand scope of this invention.

[0113] 5.5 Multistage Document Search Method

[0114] The multistage method compares a query with a set of strings,which is denoted a “database” of “records.” A document can beassimilated to this model if a “record” is defined to be a paragraph orother such natural document subdivision. The multistage method can thenbe straightforwardly applied to searching a single document forparagraphs containing a high degree of similarity to a query.

[0115] Often, documents are collected into repositories. For example,the textual component of a website can be viewed as a collection ofinterconnected documents, where each document is an HTML page.

[0116] The present invention adapts the multistage method for searchingdocument collections. In this adaptation, the “database record” is aparagraph or other natural subdivision of a document. If a paragraph isexcessively long, 1,000 characters being a typical threshold, it may besplit into as many database records as necessary for the size of eachpart to fall below the threshold.

[0117]FIG. 10A illustrates the internal preprocessed “database” 190 thatis the complete set of paragraphs contained in all documents in thecollection. The method retains the relation of these paragraph-recordsto the larger document of which they are parts, by storing together withthe text 192 of each paragraph-record a non-searchable string 194 that“links” that paragraph-record to its source document. Typically, thisnon-searchable string is the web URL of the whole source document, or alookup string that will retrieve the whole source document from adatabase management system. The sequence number 196 of theparagraph-record in the source document is also stored with theparagraph-record.

[0118]Figure 10B illustrates the operation of multistage documentsearch. A query Q is compared against a database of D paragraph-recordsaccording to the preferred embodiment of the invention 200. The outputis a ranked list L of paragraph-records, together with the links totheir source documents and their sequence numbers in those documents202. Paragraph-records in L originating from the same source documentare now reduced to the highest-rankdng paragraph, as follows. For eachsource document S represented in L 204, the subset of records Rs in Lthat belong to S is identified 206. Then the highest-ranking(lowest-cost) record r in this subset is identified 208, and all otherrecords in R_(s) are removed from L, preserving the order of allremaining records in L 210. When all source documents represented in Lhave been processed in this way 212, the resulting list L of paragraphsis displayed to the user, together with active links to the sourcedocuments 214.

[0119] Thus the results list presented by the multistage document searchis actually a list of documents, each represented by one salientmatching paragraph. When the list search is presented to the user (inthe usual case via a web interface), the user sees not only the salientmatching paragraphs with matching characters highlighted, but also linksto the whole source document. Clicking on a link 216 associated with aparagraph-record r takes the user to its source document S. Themultistage document search combines visual match strength feedback withstatic or dynamic processing of the source document, so that the user isplaced in the source document at the location of the matchingparagraph-record, with the matching characters in that recordhighlighted.

[0120] Placing the user in the proper context in S is accomplished intwo ways depending on whether S itself has changed since the time it waspreprocessed into paragraph-records 218. If S has changed, a warning isissued to the user 220, since the highest-ranking paragraph in S may nowbe different from what he expects. The modified S is now re-parsed intoparagraph-records and sequence numbers 222. The query Q is nowre-compared against this set of paragraphs R_(mod) 224, and thehighest-ranking (lowest-cost) paragraph r in R_(mod) is identified 226,and its sequence number i_(r) is retrieved 228. Modified document S isnow displayed to the user at paragraph i_(r) 230. Visualization ofcharacter match strengths may be effected as described above, usingvisual highlighting based on the bipartite graph computed for r in step224.

[0121] If S has not changed, it is re-parsed into paragraph-records andsequence numbers 232. The sequence number i_(r) of r is retrieved 234,and document S is now displayed to the user at paragraph i_(r) 236.Visualization of character match strengths may be effected as describedabove, using visual highlighting based on the bipartite graph computedfor r in step 200.

[0122] While the invention has been described with reference to thepreferred embodiment thereof, it will be appreciated by those ofordinary skill in the art that modifications can be made to thestructure and elements of the invention without departing from thespirit and scope of the invention as a whole.

We claim:
 1. A method of searching a database for a query that deliversfeedback to the user consisting of computed match strengths of theindividual characters comprising each retrieved database record.
 2. Amethod of searching a database as set forth in claim 1 that uses anytechnique or techniques of visual highlighting to visually represent tothe user the relative match strengths of database characters, includingbut not limited to the use of colored fonts, typefaces of differingsize, typefaces of differing style, and underlining.
 3. A method ofsearching a database for a query comprising the steps of: (a) providinga database of strings of characters; (b) providing a query string; (c)identifying polygraphs that occur in said query string and also in saiddatabase strings; (d) providing a match cost to each said identifiedpolygraph; (e) positioning the query string relative to each databasestring; (f) matching polygraph occurrences in the query string withthose in each database string, the cost of matching providing anumerical indication of the similarity between said query string andeach said database string; (g) realigning said query string to reducethe cost by examining edges present in the matching solution; (h)repeating said matching step (f) and said realigning step (g) apredetermined number of times or until the cost of matching fails toimprove; (i) repeating the steps (c) to (h) above for each databasestring for the purpose of identifying those database strings mostsimilar to said query string; (j) computing the match strength of eachcharacter in each database string based upon the match cost of eachmatching polygraph that includes that character; and, (k) displaying thedatabase strings most similar to the query string using highlightingtechniques to visually represent relative match strengths of databasecharacters.
 4. A method of searching a database for a query as set forthin claim 3 in which the polygraph costs provided in step (d) areweighted in inverse proportion to the length of the polygraph, therebyattenuating the effects of polygraph inclusion on matching step (f). 5.A method of searching a database for a query as set forth in claim 4 inwhich steps (e) through (g) are performed a minimal number of times foreach alignment in a set of initial alignment positions, which mayinclude left-alignment, right-alignment, and a number of intermediatealignment points, and in which step (h) is replaced by identifying theinitial alignment which has produced the best matching so far obtained,and repeating steps (e) through (g) for this initial alignment apredetermined number of times, or until the cost of matching fails toimprove.
 6. A method of searching a database for a query as set forth inclaim 5 in which matched database strings are penalized in proportion tothe departure from left-alignment of the query alignment that producesthe best matching.
 7. A method of searching a database for a query asset forth in claim 6 in which matched database strings are penalized inproportion to their length.
 8. A method of searching a database for aquery as set forth in claim 7 further comprising the step ofpost-processing of the matchings computed for each database string inorder to improve match quality, either by adjusting the contribution ofsome graph edges to the total match cost of a database string, or bymaking any other adjustments or alterations to the bipartite graph.
 9. Amethod of searching a database for a query as set forth in claim 8further comprising a prefilter means for operating on a pre-computedindex of polygraphs to eliminate database strings from furtherconsideration.
 10. A method of searching a database for a query as setforth in claim 9 further comprising the step of operating on apre-computed index of all polygraphs of a fixed length N, wherein eachindex entry is a list of all strings in the database that contain aparticular N-graph.
 11. A method of searching a database for a query asset forth in claim 10 in which said prefilter means retrieves databasestrings from the index lists one list at a time, traversing the lists inorder of increasing list length.
 12. A method of searching a databasefor a query as set forth in claim 11 in which said prefilter meansretrieves database strings from the index lists in such a way as to keeptrack of how many distinct N-graphs each retrieved database string hasin common with the query, returning preferentially those strings havingthe most N-graphs in common with the query.
 13. A method of searching adatabase for a query as set forth in claim 8 comprising the step ofperforming a search on a transduced version of the database using atransduced version of the query; or two parallel searches on transducedand non-transduced versions of the database using transduced andnon-transduced versions of the query respectively, with the results ofthe two searches being merged.
 14. A method of searching a database fora query as set forth in claim 13 further comprising the step ofidentifying database strings occurring more than once in the mergedoutput list and removing all occurrences but the most similaroccurrence.
 15. A method of searching a database for a query as setforth in claim 14 in which the transductive preprocessing applied toquery string and database strings produces a phonetic transcription ofthe query and the database.
 16. A method of searching as set forth inclaim 14 further comprising searching non-alphabetic textual data byapplying transductive preprocessing to database strings and query stringthat transforms non-alphabetic text into an alphabetic equivalent.
 17. Amethod of searching a database for a query as set forth in claim 8 wheresaid database strings are paragraphs or other natural subdivisions ofthe documents in a document collection, and where the relation of eachparagraph to its source document is preserved by associating with eachparagraph a pointer or link to the source document, and the sequencenumber of the paragraph in the source document.
 18. A method ofsearching a database for a query as set forth in claim 17, in which theoutput is a list of matching links to source documents, together withthe paragraph in the document that is most similar to the query, or theparagraph or paragraphs that are deemed most significant based onmatching costs and/or relative location and distribution within thesource document.
 19. A method of searching a database for a query as setforth in claim 18 which re-parses source documents and/or re-comparesthem with a query, such that when the user views the source document,said viewer is placed automatically at the position of a matchingparagraph.